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ABSTRACT 

Large graphs arise in a number of contexts and understand¬ 
ing their structure and extracting information from them 
is an important research area. Early algorithms on mining 
communities have focused on the global structure, and of¬ 
ten run in time functional to the size of the entire graph. 
Nowadays, as we often explore networks with billions of ver¬ 
tices and find communities of size hundreds, it is crucial 
to shift our attention from macroscopic structure to micro¬ 
scopic structure when dealing with large networks. A grow¬ 
ing body of work has been adopting local expansion methods 
in order to identify the community from a few exemplary 
seed members. 

In this paper, we propose a novel approach for finding 
overlapping communities called LEMON (Local Expansion 
via Minimum One Norm). Different from PageRank-like 
diffusion methods, LEMON finds the community by seeking 
a sparse vector in the span of the local spectra such that 
the seeds are in its support. We show that LEMON can 
achieve the highest detection accuracy among state-of-the- 
art proposals. The running time depends on the size of 
the community rather than that of the entire graph. The 
algorithm is easy to implement, and is highly parallelizable. 

Moreover, given that networks are not all similar in na¬ 
ture, a comprehensive analysis on how the local expansion 
approach is suited for uncovering communities in different 
networks is still lacking. We thoroughly evaluate our ap¬ 
proach using both synthetic and real-world datasets across 
different domains, and analyze the empirical variations when 
applying our method to inherently different networks in prac¬ 
tice. In addition, the heuristics on how the quality and 
quantity of the seed set would affect the performance are 
provided. 
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I. INTRODUCTION 

Analyzing the structure and extracting information from 
complex networks is an important research area. Signifi¬ 
cant research has been carried out in finding the structure 
of networks and identifying communities [10| . 

In early work, researchers assumed that communities were 
disjoint and had more internal connections than external 
connections. Both assumptions have been discarded since 
it is clear that in most networks a vertex belongs to more 
than one community. Eor instance, in social networks, one 
might belong to a work community, a community of friends, 
and a community of individuals that share the same hobby 
such as golf; in co-purchased networks, one item might be¬ 
long to multiple categories. Also since we are dealing with 
networks with hundreds of millions of vertices, an individual 
in a community of size lOfQwill certainly have more links 
outside the community than inside. These key insights have 
motivated researchers to identify communities from a new 
perspective. 

Considerable researches on detecting communities have fo¬ 
cused on the global structure. And these globally based de¬ 
tection algorithms usually run in time functional to the size 
of the entire graph, a major drawback in computational cost. 
Nowadays, we explore networks with billions of vertices to 
find communities of size a hundred. Thus, taking the entire 
graph into account might not serve as a practical solution 
in many situations. It is crucial to shift our attention from 

^A stat istical study on social networks done by Leskovec et 
al. has shown that real-world communities with high 
quality are quite small and usually consist of no more than 
100 vertices. 



macroscopic structure to microscopic structure when deal¬ 
ing with large networks, and develop new approaches that 
enable finding communities in time functional to the size of 
the community. 

Quite recently, there has been a growing interest in find¬ 
ing communities by locally expanding an exemplary seed set 
in the community of interest [6 14 This type of al¬ 


gorithm usually starts with a few members that are already 
known to be in the target community, and the goal is to 
uncover the remaining members in the same community as 
the exemplary members. These known members are usu¬ 
ally referred to as seeds in the literature, and the process 
of gradually growing the seed set into a larger set until the 
target community is revealed is called seed set expansion. 
The setting of seed set expansion can be widely applied to 
many real world applications. For example, in web search, 
with a few known pages that share similar information, we 
could generate a larger group of web pages that contain the 
relevant contents with respect to a certain search query; in 
product networks, seed set expansion enables the automatic 
categorization of products that are discovered to be in the 
same community as the labeled items. 

The random walk technique has been extensively adopted 
as a subroutine for locally growing the seed set in the lit¬ 
erature [^ [^ [^ [^ [^ [^ [^. The dynamics of random 
walks are effective in finding a local community since they 
make non-uniform expansion decisions based on the struc¬ 
ture revealed during the exploration of the neighborhood 
surrounding the seeds [^. This implies that random walk 
based local expansion is able to trace the community mem¬ 
bers in a principled way that resembles the natural process 
of forming the local community structure. Very recently, 
Abrahao et al. also experimentally verified that random 
walk produces communities that are most structurally sim¬ 
ilar to real-world communities amongst various algorithmic 
communities. 

In this paper we propose a novel approach for finding 
overlapping communities called LEMON (Local Expansion 
via Minimum One Norm|^for finding overlapping communi¬ 
ties in large networks. We systematically demonstrate that 
LEMON can achieve both high efficiency and effectiveness 
that significantly stand out amongst state-of-the-art propos¬ 
als. Specifically, we consider the span of a few dimensions 
of vectors after the short random walk and use it as the ap¬ 
proximate invariant subspace, which we refer as local spectra. 
What makes LEMON distinct from previous PageRank-like 
diffusion methods is by utilizing the subspace rather than a 
single probability vector. In contrast to the traditional spec¬ 
tral clustering methods, the local spectral method does not 
require the burdensome computation of a large number of 
singular vectors. In addition, as traditional spectral meth¬ 
ods usually partition the vertices into disjoint communities, 
we make another fundamental change. Concretely, we mine 
the communities from the subspace by seeking a sparse ap¬ 
proximate indicator vector in the span of the local spectral 
such that the seeds are in its support. In practice, this can 
be mathematically achieved by solving an ^^-penalized linear 
programming problem. 

We aim to develop a comprehensive understanding of the 
local spectral approach for identifying a community from a 
small seed set. Following the central idea of our approach. 


we seek to solve fundamentally important questions such as: 
what defines “good” communities and when do they emerge 
as we expand the seed set (Section |4.4[ )? How to find a small 
community in time functional to the size of the community 
rather than that of the entire graph (Section |4.3[ )? What 
defines “good” seeds and how many seeds could uniquely de¬ 
fine a community (Section]^? And given that networks are 
not all similar in nature, how the local expansion approach 
is suited for uncovering communities in different types of 
networks (Section 6.41? 


We thoroughly evaluate our approach using both synthetic 
and real-world datasets across different domains, and ana¬ 
lyze the empirical variations when applying our method to 
inherently different networks in practice. We believe that 
the insights we gained from researching on these problems 
would provide valuable guidance for future investigation on 
this topic. 

2. RELATED WORK 

A considerable amount of literature has been published 
on finding communities in large social and information net¬ 
works. We highlight a few ideas that have recently emerged 
in the literature to clarify how our method differs. 

Globally based community finding algorithms. Var¬ 
ious community detection algorithms have been developed 
in the past decade. And most of the algorithms fall into 
the category of global approach. One stream of global al¬ 
gorithms attempt to find communities by optimizing an ob¬ 
jective function. For example, OSLOM is based on the 
optimization of a fitness function, which expresses the sta¬ 
tistical significance of clusters with respect to random fluc¬ 
tuations (i.e., the random graph generated by the configura¬ 
tion model 19 during community expansion). However, the 


^Our demo code is publicly available at: https://github. 
com/yixuanli/lemon 


communities identified by mathematical construction may 
structurally diverge from real communities as pointed out in 

. Another main stream of research adopts the label prop¬ 
agation approach [22| , which dehnes rules that simulate the 
spread of labels of vertices in the network. The DEMON 
algorithm [^, for example, democratically lets each vertex 
vote for the communities it sees surrounding it in its limited 
view of the global system using a label propagation algo¬ 
rithm, and then merges the local communities into a global 
collection. Other approaches such as Link Community (LC) 
partitions the graph by first building a hierarchical link 
dendrogram according to the link similarity and then cutting 
the dendrogram at some threshold to yield link communities. 

Random walk based detection algorithms. As noted 
in the preceding section, among the divergent approaches, 
random walks tend to reveal communities that bear the clos¬ 
est resemblance to the ground truth communities . In the 
following, we briefly review some methods that have adopted 
the random walk technique in finding communities. Speak¬ 
ing of methods that focus on the global structure. Pons et al. 
[21| proposed a hierarchical agglomerative algorithm, Walk- 
Trap, that quantified the similarity between vertices using 
random walks and then partitioned the network into non¬ 
overlapping communities. Meila et al. [18| presented a clus¬ 
tering approach by viewing the pairwise similarities as edge 
flows in a random walk and studied the eigenvectors and 
values of the resulting transition matrix. A later successful 
algorithm, Infomap, proposed by Rosvall & Bergstrom 
enabled uncovering hierarchical structures in networks by 
compressing a description of a random walker as a proxy for 









Domain 

Dataset 

Vertices 

Links 

Average 

membership 

Maximum 

membership 

Community 
size mean 

Product 

Amazon 

334,863 

925,872 

0.11 

49 

39 

Collaboration 

DBLP 

317,080 

1,049,866 

0.22 

11 

251 

Social 

YouTube 

1,134,890 

2,987,624 

0.05 

41 

79 

Social 

Orkut 

3,072,441 

117,185,083 

9.56 

504 

83 


Table 1: Statistics for the real networks. 


real flow on networks. Variant of this technique such as bi¬ 
ased random walk |28| has also been employed in community 
finding. 

Seed set expansion based approaches. To interpret 
the problem of community detection from a local perspec¬ 
tive, our work is in the same spirit as the seed set expansion 
algorithms in [^, 


141 and 25 . Specifically, 


Andersen & Lang ^ adapted the theoretical results from 
|24| to expand a set into a community with locally minimal 
conductance based on lazy random walks. However, the lazy 
random walk endured a much slower mixing speed and usu¬ 
ally took more than 500 hundred steps before converging to 
a local structure, which is inefficient compared with several 
steps of rapid mixing in a regular random walk. Featuring 
on the seeding strategies. Whang et al. [25| established sev¬ 
eral sophisticated methods for choosing the seed set, and 
then used similar PageRank scheme as that in to expand 
the seeds until a community with the optimal conductance 
is found. Nonetheless, the performance gained by adopting 
these intricate seeding methods was not significantly better 
than that by using random seeds. This implies that a bet¬ 
ter scheme of expanding the seeds is also needed aside from 
a good seeding strategy. A recent work by Kloumann & 
Kleinberg provided a systematic understanding of vari¬ 
ants of PageRank-based seed set expansions. They showed 
many insightful findings regarding the heuristics on seed set. 
However, the drawback of lacking a proper stop criterion has 
limited its functionality in practice. Even though a recently 
proposed heat kernel algorithm advances PageRank by 
introducing a sophisticated diffusion method, the detection 
accuracy achieved by heat kernel approach is still much lower 
than that of LEMON, which is shown in Section 16.21 


distribution It provides researchers with rich flexibil¬ 
ity to control the network topology through tuning different 
parameters, including the graph size n, the average degree 
k, the maximum degree kmax, the minimum and maximum 
community size \C\min and \C\max, the mixing parameter /i, 
the overlapping membership om and the number of vertices 
with overlapping membership on. Among these parameters, 
the mixing parameter /r has the most significant impact on 
the network topology, which controls the fraction of links 
for each vertex that cross to a community with which the 
vertex is not associated. Usually, larger /i would result in 
lower detection accuracy. 

Xie et al. have performed a thorough performance 
comparison of different state-of-the-art overlapping commu¬ 
nity detection algorithms on LFR benchmark datasets. To 
make the performance evaluation of our algorithm consis¬ 
tent with that in , we adopt the same parameters in our 
paper. In total, we generate two sets of networks with mix¬ 
ing parameter /r = 0.1 and /r = 0.3 respectively. We vary 
the parameter om from 2 to 8 for each ^ and obtain a total 
of 14 networks. Tablelists the value of the parameters we 
have used for generating the LFR datasets. 


Parameter 

Description 

Value 

n 

graph size 

5000 

n 

mixing parameter 

{0.1, 0.3} 

k 

average degree 

10 

kmax 

maximum degree 

50 

\^\min 

minimum community size 

20 

\^\max 

maximum community size 

100 

T1 

node degree distribution exp. 

2 

r2 

community size distribution exp. 

1 

om 

overlapping membership 

(2, 3, ..., 8} 

on 

overlapping node 

2500 


3. PRELIMINARIES 

3.1 Problem Statement 

Given a network G = {V,E) and a set of members 5 in 
the target community C, where \C\ <C \V\ and |5| <C |C|, we 
are interested in discovering the remaining members in C. 
Generally speaking, we focus on addressing the question of 

how to accurately find a small community in time 
functional to the size of the community from a seed 
set? 

3.2 Datasets 

3.2.1 Synthetic datasets 

The LFR benchmark graphs [15| have been widely adopted 
for the purpose of evaluating the performance of commu¬ 
nity detection algorithms. LFR datasets are generated with 
built-in community structure that resembles the features 
found in most real-world networks with power-law degree 


Table 2: Parameters for the LFR datasets. 


3.2.2 Real datasets 

For the purpose of testing on real networks, we include 
four datasets with ground truth community membership 
from Stanford Network Analysis Project [^. These datasets 
span various domains of network applications, including prod¬ 
uct networks (Amazon), collaboration networks ^BLP), 
and online social networks (YouTube and OrkutjH Each 
network can be viewed as an undirected, connected graph. 
The statistical information of the datasets is summarized in 
Table [T] 


^For all the four real datasets, we adopt the top 5000 com¬ 
munities that possess the highest quality according to [27|. 













3.3 Evaluation Metric 

For the evaluation metric, we adopt FI score to quantify 
the similarity between the algorithmic community C and the 
ground truth community C*. The FI score for each pair of 
(C,C*) is defined by: 

r (C ^ Precision{C,C*) ■ Recall(C,C*) 

^ ’ Precision{C,C*) + Recall{C,C*) ’ 

where the precision and recall are defined as: 

Precision{C,C*) = (2) 

1^1 

Recall{C,n=^-^-^. (3) 

Throughout the paper, unless otherwise pointed out, the 
experimental results on synthetic data for each instance are 
given by the statistical mean and standard deviation based 
on 24 test case^ and the experimental results on real datasets 
for each instance are based on 120 test cases. All the ground 
truth communities for testing are randomly chosen. The 
randomness of batch tests can guarantee the elimination of 
potential sampling bias in our tests. 

4. LOCAL EXPANSION VIA MINIMIZING 
ONE NORM 

4.1 Algorithm Overview 

Spectral clustering makes use of a small number of sin¬ 
gular vectors proportional to the number of communities in 
the network. If a graph has thousands of small communities, 
it is impractical to calculate a number of singular vectors 
greater than the number of communities. We are experi¬ 
menting with a fundamentally new technique, which does 
not require the burdensome computation of a large number 
of singular vectors. Before explaining our local spectral ap¬ 
proach for finding overlapping communities, it is necessary 
to make clear what we mean by “local spectra”. 

In traditional spectral clustering methods, one finds the 
first few singular vectors of the Laplacian matrijj^of a graph 
G with n vertices. Suppose the first d singular vectors are 
obtained, one can form an n x d matrix as a latent space. 
Then one associates with each vertex a point in this latent 
space whose coordinates are given by the entries of the cor¬ 
responding row in the matrix. Vertices are clustered using 
some method such as fc-means clustering algorithm. This 
method is not likely to work well if the communities are 
small and heavily overlapping with each other. 

We make two fundamental changes to this method. The 
first modification is to overcome the drawback of computing 
the singular vectors. Intuitively, the vertices around the seed 
members are more likely to be in the target community, thus 
a random walk serves as a natural subroutine to reveal these 
potential members. We start a random walk from several 
known members in the target community and run for a few 
steps. The number of random walk steps should be long 

^Each local expansion process from a seed set can be viewed 
as a test case. 

®In the literature, several different definitions of graph 
Laplacian exist. Readers can refer to for more details, 
which serves as a good introductory paper on spectral clus¬ 
tering. 


enough to reach out to the vertices in the target community, 
but not too long to spread out to the entire graph. Instead 
of considering a single probability vector, we consider the 
span of a few dimensions of vectors after the short random 
walks and use it as the approximate invariant subspace {local 
spectra). The second is to handle the overlapping situation. 
Instead of using fe-means to partition the points in the latent 
space into disjoint clusters, we look for the minimum 0-norm 
vector in the span of the invariant subspace obtained above, 
such that the seed members are in its support. We want to 
find rows in the invariant subspace that point in nearly the 
same direction as seed members. We will use 1-norm vector 
as a proxy for the minimum 0-norm vector since finding the 
0-norm vector is an NP-hard problem. 

In the following, we give a formal description of our local 
spectral approach LEMON for detecting the target commu¬ 
nities from a small seed set. Given the input of a set of few 
vertices 5 that are already known to be in the target ground 
truth community C*, our algorithm would output the algo¬ 
rithmic community C such that the FI measure for scoring 
the similarity between C and C* is maximized. 

Step 1. Generate the local spectra: 

Let A = be the normalized adjacency 

matrix of the graph where D is the diagonal matrix of vertex 
degree. Consider a random walk starting from exemplary 
vertices in 5. Let po denote the initial probability vector 
where the total probability is evenly distributed among the 
seed members. Consider the span of /-dimensional probabil¬ 
ity vectors which consist of probability vectors in / successive 
random walks 

Po,i = [po,Pi,...,Pi]. (4) 

The initial invariant subspace is then obtained by calculating 
the orthonormal basis of the span Po.i, which we denote by 
Vo,i. We then use the following recurrence to iteratively 
calculate the /-dimensional orthonormal basis Nk.i after k 
steps of random walk 

Vfe,iRfc,i = Vfc_i,iA, (5) 

where Ilk,i £ is chosen such that Vfe,; is orthonormal. 

The orthonormal basis "Vk.i will be used as the local spectra 
for clustering. 

Step 2. Seek for a sparse vector 

With the local spectra V*,,;, we then solve the following 
linear programming problem 

min e^y = ||y||i 
S.t. y = Vfc,;X, 

y>o, 
y(5) > 1, 

where e is a vector of all ones, and both x and y are unknown 
vectors. The first constraint indicates that y is in the space 
of Nk.i- The element in y indicate the likelihood for the 
corresponding vertex belong to the target community, which 
is non-negative. The third constraint enforces that seeds are 
in the support of sparse vector y. 

After sorting the elements in y in non-ascending order and 
getting a vector y, the vertices corresponding to the top \C\ 
elements in y are returned as the detected community with 
respect to the seed set S. 
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Figure 1: The average FI score on Amazon network with varying dimensions I and random walk step k, 
respectively. The plots depict the statistical regression line with a 95% confidence interval. 


Step 3. Reseeding 

Augment the initial seed set by adding the vertices corre¬ 
sponding to the top t elements of y. Denote the augmented 
seed set as S'. Then repeat step 1 and step 2 using the aug¬ 
mented seed set S'. The detection accuracy can be improved 
through iterations via increasing t by a constant number s 
each time. We define s to be the seed expansion step, which 
is used as a tunable parameter for adjusting the convergence 
rate. Usually, the larger expansion step would result in lower 
performance but a faster running speed with less iterations. 
In the experiments, we fix the seed expansion step to be 
6 for both synthetic and real datasets. The number of it¬ 
erations for the seed expansion is determined by the stop 
criteria (Section |4.4[ ). 

4.2 Parameter Selection 

The random walk step k and subspace dimension I are the 
key parameters in the local spectral clustering algorithm. 
We conduct parameter sensitivity study for these two pa¬ 
rameters on the four real datasets. 

4.2.1 Subspace dimension 

To study the parameter of subspace dimension Z, we fix 
the random walk step to be 3, and vary the number of di¬ 
mension Z from 1 to 15. Figure (left panel) shows that 
changing the dimension Z does not cause significant perfor¬ 
mance fluctuation. On one hand, choosing a large dimension 
Z is undesirable because it would increase the computation 
cost of generating local spectra. On the other hand, when 
dimension degrades to Z = 1, the standard deviation of FI 
score becomes significant, making the detection accuracy un¬ 
stable. In this paper, we fix Z = 3 because the experiment 
suggests that setting 1 = 3 can statistically achieve both 
high and stable performance. Note that such observation 
holds not only for Amazon network, but for the remaining 
real datasets as well. 

4.2.2 Random walk Step 

To investigate how the step of random walk affects the 
algorithm performance, we fix the dimension Z to be 3, and 
vary the random walk step k from 1 to 15. Figure (right 
panel) shows that the average FI score plateaus as k in¬ 
creases, and 3-step random walk can yield the algorithm’s 
full potential. The standard deviation, however, signifi¬ 
cantly increases when k exceeds 10. This indicates that 
longer random walk is undesirable for stably uncovering the 


local community structure. Throughout the paper, we fix 
the random walk step k = 3 for the real dataset^ 

4.3 Complexity Reduction by Sampling Method 

If one wants to uncover a small community within a large 
network with billions of vertices, it would be very costly 
to take all the vertices into account. We want to discover 
the target community accurately while keeping the number 
of vertices examined small. Sampling method can effectively 
solve the memory consumption issue when one wants to find 
a local community within a large graph, since the whole 
graph does not have to be loaded into memory. 

In practice, the unknown members in the target commu¬ 
nity are more likely to be around the seed members, and 
are usually a few steps away from the seeds. This observa¬ 
tion motivates us to reduce the complexity by taking only a 
portion of the graph into consideration. Ideally, this partial 
graph should contain as many vertices in the target commu¬ 
nity as possible, and maintains a small size of the same scale 
as that of the target community. 

To sample the graph, we expand the seed set using random 
walk. After a few steps of the random walk, vertices with 
large probability are more likely to be in the target com¬ 
munity while vertices with small probability being reached 
would be treated as redundant ones. If the target commu¬ 
nity exists for the seed set, then according to [^, this target 
community would serve as a bottleneck for the probability 
to be spread out. It is worthwhile noting that other ex¬ 
pansion methods such as breadth-first-search (BFS) would 
entirely ignore the bottleneck defining the community and 
rapidly mix with the entire graph before a significant frac¬ 
tion of vertices in the community have been reached. The 
subgraph returned by BFS usually contains less vertices in 
the target community than the subgraph of the same size 
obtained by random walk technique. 

In the experiments on real datasets, we conduct a random 
walk starting from the seed set until the probability has been 
spread out to a ■ |C|avg vertices, where a is some constant 
and |C|avg is the average community size in the graplQ Note 
that a ■ |C|avg should be large enough to be able to cover as 
many vertices in the ground truth community as possible. 

®For LFR benchmark graphs, we adopt all together 
6 combinations for the (step, dimension) tuple: 

(2, 3), (2, 4), (2, 5), (3, 3), (3,4), (3, 5) and return the highest 
FI score among using these combinations. 

^A fast implementation for updating the probability vector 
of the random walks is featured in detail in [^, Section 4. 
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Figure 2: Comparison of the average FI score with ground truth and automatic size determination. The left 
corresponds to the LFR datasets when p = 0.3 and the right corresponds to the real datasets. 


This newly obtained subgraph will be used for the remaining 
computation. 

Tablej^gives the statistics after applying sampling method 
to the real networks. For example, in DBLP network, set¬ 
ting a to be around 10 would yield a subgraph containing 
on average 98% vertices in the ground truth community. Af¬ 
ter sampling, we only need to deal with a subgraph of size 
around 2400 instead of 317,080, bringing a significant reduc¬ 
tion of both temporal and spatial complexity. 


Dataset 

Coverage 

ratio 

Sample 

rate 

|C|avg 

Subgraph 

size 

Amazon 

1.00 

0.0087 

39 

2913 

DBLP 

0.98 

0.0076 

251 

2409 

YouTube 

0.66 

0.0033 

79 

3745 

Orkut 

0.64 

0.0011 

83 

3379 


Table 3: Statistics of the mean values for the sam¬ 
pling method on real datasets. 


4.4 Stop Criteria 

If there are ground truth communities available, the above 
algorithm is guaranteed to stop within few iterations since 
the seed set will no longer augment once its size exceeds 
that of the ground truth community. The algorithm would 
then return the community found with the highest FI score 
during the iterations as the result. However, in real case, 
without knowing the exact size of the communities, most lo¬ 
cally based detection algorithm has difficulty deciding when 
is the proper time to terminate expanding such that the 
discovered community is a “good” community. It is thus 
important to solve the two issues: 1) how to automatically 
determine the size of the community given a seed set 5, and 
2) when to stop growing the seed set during the reseeding 
process. 

4.4.1 Determine the size of the community 

It has already been shown that random walks produce 
communities with conductance guarantees and ensure a small 
boundary defining a natural community in locally based de¬ 
tection algorithms The intuition is that adding irrele¬ 
vant vertices to the target community would inevitably cause 
the conductance to increase, and finding a low-conductance 


community could ensure the closeness between the detected 
members and the known seed set. In 25 , the authors also 
follow the same idea and adopt conductance as the met¬ 
ric for defining a good community found by the algorithm 
around a seed set. As we will see, the local conductance 
for a small group of vertices in the graph contains valuable 
information and enables designing effective stop criteria for 
our algorithm. 

The definition of the conductance for a set of vertices C is 
given by 

|9(C)I 


m = 


min(Vol(C),Vol(C)) 


( 6 ) 


where |d(C)| denotes the cut size, and Vol(C) is the sum of 
vertex degree in the set C. 

Suppose we have a rough estimation of the lower and up¬ 
per bound for the size of communities in a graph, which we 
denote by |C|min and |C|inax respectively. We could modify 
the original algorithm in the following way. 

At step 2, after obtaining the sorted sparse vector y, we 
are hoping to truncate the sorted vector at some point yg 
such that all the vertices corresponding to the elements no 
less than yg are included in the algorithmic community. The 
crux lies in that we do not know yet what the best position 
is to truncate the vector y. To solve this issue, we denote 
Ai as the set of vertices corresponding to the top i elements 
in y. We then sweep over the sets from A|c|^i„ to A|c|^^^ 
and calculate the corresponding conductance for each of the 
sets. In practice, the value of the conductance with respect 
to varying size would usually change in a non-monotonic 
pattern that decreases first and then increases later on. We 
then adopt the first relative minimum conductance encoun¬ 
tered on this curve as the estimated size of the community 
with respect to the seed set 5, which we denote by 


4.4.2 Stop the reseeding process 
As we keep augmenting the seed set through reseeding at 
step 3, a different seed set would result in a different sparse 
vector y and thus lead to potentially different algorithmic 
communities. Practically, one of these seed sets during the 
augmenting process would achieve the highest FI score. And 
it remains to address the issue of when to stop growing the 
seed set so that it finds the community that resembles most 
of the ground truth community. This issue can be solved 
in a similar fashion as that for determining community size. 
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= 0.1) and real datasets with different seeding methods. 


Figure 3: The average FI score on LFR datasets (/i 

Specifically, we keep track of the value of for different 
seed set during the expansion, and stop to grow the seed set 
when reaches a local minimum and starts to increase 
for the first time. 

4.4.3 Auto detect size V5 ground truth size 

To verify our method, we compare the performance after 
applying the stop criteria with that obtained using ground 
truth communities. Figure shows the statistical result 
of FI score on both synthetic and real datasets. On both 
datasets, the FI score with automatic size determination is 
only lowered by 10% on average compared with the perfor¬ 
mance with available ground truth. This implies that our 
method is applicable for finding communities that mostly 
resemble the ground truth communities on both synthetic 
and real datasets in different domains. It also suggests that 
our method can be applied in practice to uncover natural 
communities in the situation when no ground truth is avail¬ 
able. 

5. SEEDING 

Since the initial seed set serves as a key component in 
our algorithm for uncovering the target community C, it is 
crucial to consider how the quality of seed set affect the 
performance. In practice, there is not much control over 
how the seeds are selected. However, the alternative seeding 
methods we provide here can be strategically applied by the 
domain experts based on the availability of candidate seeds 
in different scenarios. In this section, we will focus on ad¬ 
dressing two fundamentally important issues regarding the 
seed set: 1) What dehnes “good” seeds? and 2) How many 
seeds are needed in order to uniquely dehne a community? 

5.1 Seeding Method 

To give a well-rounded evaluation on this, we encompass in 
total hve different seeding methods here. In this experiment, 
we adopt |5| = 3 seeds for each of the seeding method listed 
below. 

• High degree seeding: pick |5| vertices with degree 
ranked in the top one third among the degree of all 
vertices in C. 

• Low degree seeding: pick |5| vertices with degree 
ranked in the bottom one third among the degree of 
all vertices in C. 


• Triangle seeding: pick |5| vertices in C that form a 
triangle as the initial seed set. 

• Random seeding: pick |5| vertices in C randomly. 

• High inward-edge ratio seeding: the inward-edge 
ratio for a vertex v is defined by the fraction of links 
connecting to another vertex inside the target commu¬ 
nity C among all the links coming out from v. We pick 
|5| vertices with inward-edge ratio ranked in the top 
one third among all vertices in C. 

Figure[^(left panel) gives the experimental results on LFR 
benchmark datasets with mixing parameter /j. ~ 0.1. It 
is interesting to note that the high-degree seeding method 
consistently achieves the highest FI score in both groups of 
datasets. When /r = 0.1, triangle seeding leads to the worst 
performance with low FI score and high standard deviation. 
This implies that seeding from a compact core structure is 
less advantageous than seeding sporadically among vertices. 
The intuitive explanation behind this phenomenon is that it 
is more difficult for the probabilities to spread out when the 
random walk initiates from a cohesive structure. 

Another interesting observation is that high inward-edge 
ratio seeding method can consistently lead to the best per¬ 
formance among different seeding methods on both synthetic 
and real datasets. In [^, the authors have the same obser¬ 
vation as ours but did not give an explicit explanation on 
this phenomenon. In fact, when a large fraction of the seeds 
links connect to vertices within the same community, ran¬ 
dom walks starting from these seeds would be more likely to 
transit probabilities into the vertices within the community 
rather than spreading out to vertices outside the commu¬ 
nity. A higher detection accuracy can be thus achieved since 
the target community contains much of the probability after 
short random walks. 

Moreover, it is also striking to note the difference be¬ 
tween the test results on synthetic datasets and that on 
real datasets. Even though the high-degree seeding method 
can always bring higher performance than that of random 
seeding on synthetic datasets, the behavior of these seed¬ 
ing methods on real networks is quite different. In Figure 
(right panel), we see that low-degree seeds lead to better re¬ 
sult than that of high-degree seeds on DBLP and YouTube 
datasets. The degree of seeds does not have a significant 
impact on the performance in Amazon and Orkut networks 
since the performance of high-degree seeding and low-degree 










seeding almost tie with each other on these datasets. In , 
the authors compared the detection accuracy of PageRank 
based seed set expansion algorithm with high-degree seeding 
and random seeding on real networks, and concluded that 
random seeding method always outperforms high-degree seed¬ 
ing in all domains of real networks. However, we remark here 
that this observation does not apply to our algorithm as we 
find that high-degree seeding works slightly better than ran¬ 
dom seeding on Orkut and YouTube datasets. 

5.2 Seed Set Size 

It is also interesting to investigate how the size of the seed 
set affects the performance of our algorithm. 

We first experiment on the LFR benchmark datasets with 
varying seed set size. We choose seed set of size propor¬ 
tional to the size of the target community C. Specifically, 
we test with five different seeding ratios r: 2%, 4%, 6%, 8% 
and 10% respectively, and round r • |C| to an integer if it 
is a fraction. Figure shows the FI scores when /r = 0.1. 
The algorithm’s performance can be improved in general 
as the seed set size increases. In the case when both mix¬ 
ing parameter and overlapping membership are small, e.g., 

= 0.1, om = 2, increasing the seed set size does not seem 
to affect the performance significantly, and seed set consist¬ 
ing of a small percentage of vertices are sufficient to discover 
the target community with high accuracy. This implies that 
when the structure of a small community is well-defined, our 
algorithm only needs 2 to 3 seeds to reveal the remaining 
members in a community of size roughly 100. For general 
test purpose on LFR benchmark graphs, we adopt an 8% 
fraction of the vertices in the target community as seeds. 
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Figure 4: The average FI score on LFR datasets 
(n = 0.1) with different seeding ratio. 


We then carry out the similar experiment on the real 
datasets. The result on real networks is interesting because 
increasing the seed set size has little affect on the perfor¬ 
mance. Especially, using only 3 seeds can yield almost the 
same performance as using an 8% fraction of the vertices in 
the target community as seeds on real datasets. 

Our algorithm is thus advantageous to many other seed set 
expansion algorithms that usually require a higher fraction 
of vertices to be known. For example, in [^, the authors 
perform a similar experiment on DBLP network. The per¬ 
formance of their algorithm achieves the maximum recall of 
0.3 when seeding ratio is 10%, while LEMON can achieve 
an average FI score of 0.66 with 3 vertices. This makes our 
algorithm practical for real networks when it is impossible 
to collect a large number of seeds. 


5.3 Further Extension 


As the results of using different seeding methods suggests, 
high-degree seeds can heuristically lead to better result on 
synthetic data. Such heuristic implies that a vertex with 
higher degree may exert higher impact on shaping the sub¬ 
space we are looking for, and thus affect the performance 
by leading to different sparse vectors where we obtain the 
“candidates” of the target community from. 

In practice, we usually have little control on the seed set. 
The chance we get a seed set of high-degree members is 
rare. More often than not, the degree of seeds is randomly 
distributed. We are therefore inspired to tailor our algorithm 
accordingly in order to emphasize the seeds with high degree. 
The modification is rather straightforward: when calculating 
the initial probability vector po to start a random walk from, 
instead of evenly distributing the amount of probability to 
each seed, we initialize the probability vector according to 
the degree of each seed. Formally, 


Po{vi) 


f d(vi)/Vol(S) 

I 0 


if Vi G S 
otherwise 


(7) 


where d{vi) denotes the degree of vertex Vi. In other words, 
we enforce a bias towards the high-degree vertices at the 
beginning of the random walk. Note that each time after the 
reseeding process, the initial probability vector also needs to 
be recalculated in the same way. 

Figure (left panel) depicts the experimental results on 
LFR benchmark graphs with and without degree normalized 
initialization for the random walk respectively. We can find 
that degree-normalization of the initial probability vector 
results in better performance. 

We then perform the same experiments on real networks, 
and find that degree-normalization would on the contrary, 
lead to slightly worse statistical results (see the right panel of 
Figure]^. The completely different behavior of using degree 
normalization on real datasets is rather intriguing. In fact, 
this phenomenon accords with our previous observation in 
Section [5.1 1 that a high-degree seed set is less advantageous 
than random seeds on real datasets. And this explains why 
emphasizing on the high-degree vertices would worsen the 
performance on real datasets. 


6. COMPARISON WITH THE STATE-OF- 
THE-ART ALGORITHMS 

6.1 Baseline Algorithms 

To give a well-rounded performance comparison with state- 
of-the-art algorithms, we compared our results to three local¬ 
ized community detection algorithms and four global com¬ 
munity detection algorithms. 

1. Localized algorithms: We encompass three locally 
based methods. Heat Kernel (HK) [13], P ageRank (PR) 
|14| and Seed Set Expansion (SSE) [2^ . 

2. Global algorithms: We also compare our local spec¬ 
tral clustering algorithm with four overlapping com¬ 
munity detection me thod s that are based on the global 
structure: OSLO]vQ[^, DEMOJvQI], and LinkCom- 
munitj|^ (LC) d. 

® http://www.oslom.org/software.htm 
“http://www.michelecoscia.com/?page_id=42 
https: //github. com/bagrow/linkcomm 




mu=0.1, Normalized 
mu=0.1, Unnormalized 
mu=0.3, Normalized 
mu=0.3, Unnormalized 


0.0 

2 3 4 5 6 7 8 

overlapping membership 



Figure 5: Comparison of the average FI score with and without normalizing the initial probability vector by 
each seed’s degree. The left corresponds to the LFR datasets and the right corresponds to the real datasets. 


6.2 Comparison with Localized Algorithms 

We refer to the experimental results reported in some 
recent publications on localized community detection algo¬ 
rithms 14 [^. Figure 1^ illustrates the comparison of 
FI scores on Amazon, DBLP, YouTube and Orkut datasets. 
We use “LEMON-auto” to denote the results obtained by 
applying the stop criteria in Section |4.4| Since the results 
on Orkut and YouTube datasets are missing in and , 
we use empty bars to indicate them. 
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Figure 6: Comparison of the average FI score with 
state-of-the-art local detection algorithms on real 
networks. 


Figurej^shows that LEMON achieves an FI score of 0.910 
on the Amazon dataset, far outperforming the other algo¬ 
rithms. The average FI scores increases the performance 


by 3 times compared with the heat kernel algorithm 13 


on Amazon, DBLP and Orkut networks. To compare with 
[25| , we find that the average FI score of our algorithm dou¬ 
bles their best performance achieved by the “spread hubs” 
method on Amazon dataset and triples the performance on 
the DBLP network. Also, note that in the authors 
did not have an explicit stop criterion. Instead, they as¬ 
sumed using a budget for predicting the size of the target 
community. We compare with the FI score at a budget of 
100 for both Amazon and DBLP datasets. From the results 
on Amazon networks in we notice that even granted a 
budget of 400, which is far beyond the average community 
size of 39 in Amazon network, only a recall of 0.45 can be 
achieved. And we infer the FI score would be even lower 
than this value since the precision is dragged down by the 
large budget set. 


It is also worth noting that we only use 3 randomly picked 
seeds for all the test cases on each dataset. Our algorithm 
requires very fewer seeds than other algorithms such as . 

The experiment has verified that our algorithm is able to 
achieve high accuracy on large networks constituting com¬ 
munities of average size roughly hundred. This implies that 
our approach is well-suited for the task of detecting small 
communities in large networks. 


6.3 Comparison with Global Algorithms 

We also compare local spectral clustering with several 
state-of-the-art global based algorithms. Table summa¬ 
rizes the running time as well as the average FI score of each 
algorithm on real datasets. Among the baselines, OSLOM 
and LC fail to terminate within 10 days on the YouTube 
dataset. The OSLOM algorithm can achieve rather good 
performance but does not scale well. 

In contrast, our algorithm can consistently return the re¬ 
sult within few seconds irrespective of how large the entire 
graph is. Besides, our algorithm has small memory con¬ 
sumption, and a machine with 4GB RAM can afford to pro¬ 
cess networks as large as Orkut since the algorithm does not 
have to store the whole graph in memory. Moreover, our lo¬ 
cally based algorithm is parallelizable because each seed set 
expansion can be computed independently. Such property 
can bring a further performance gain on running time with 
multi-threaded implementation. 

Figure [^compares the average FI score with some state- 
of-the-art algorithms on LFR benchmark graphs. During 
the experimentation, we also incorporate the methods that 
can effectively improve the performance on synthetic datasets 
that are addressed in Section [5.3| We notice that our algo¬ 
rithm outperforms the baseline algorithms even when we use 
the random seeding strategy. When the mixing parameter 

= 0.3, as is shown in Figure LEMON brings about 
30% ~ 40% relative improvement compared with the best 
results among the baselines. And we can expect the perfor¬ 
mance gain to be even more significant if the seeds possess 
the qualities discussed in Section [5.1| 

Among the four baselines, we notice that LC and DEMON 
consistently perform poorly on both groups of the synthetic 
datasets. We further look into the communities found by LC 
and DEMON respectively, and find that LC tends to par¬ 
tition the graphs into very small pieces while DEMON, on 
the contrary, usually finds communities that are much larger 
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Table 4: Comparison with global algorithms on real datasets. 
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ing communities. Such an effect can be counterbalanced by 
putting less initial probabilities on these “super cores”. 

The above empirical analysis informs us that finding com¬ 
munities in real networks seems to be less parameterized 
than that on synthetic datasets for our algorithm. This in¬ 
dicates that our algorithm is better suited for uncovering 
those naturally well-formed communities than the artificially 
constructed communities in practice. 


Figure 7: Comparison of the average FI score on 
LFR datasets with baseline algorithms. The left cor¬ 
responds to the datasets with mixing parameter = 
0.1 and the right corresponds to fi = 0.3. 

than the ground truth communities. This implies that both 
algorithms extract structures from networks that bear little 
resemblance to the natural formation of the communities. 
However, we remark here that even LC fails to recognize 
the communities well on the synthetic data, it perform bet¬ 
ter on real datasets as we see in Table |4] . 

6.4 Empirical Comparison Between Synthetic 
and Real Data 

Networks are not all similar and we cannot assume one 
algorithm works for finding communities in a network will 
behave the same on the other networks. Therefore, it is im¬ 
portant to develop the understanding of how different types 
of networks affect the behavior of algorithms. 

Our algorithm sustains a consistent performance on both 
LFR benchmark graphs and real networks though, we still 
want to summarize and call the attention to several subtle 
differences here. 

First, LEMON is less sensitive to the parameter of random 
walk step k and subspace dimension I on real networks than 
that on LFR benchmark graphs. In practice, fixing {k, 1) to 
be (3, 3) for real networks can ensure a good performance. 

Second, LEMON is less sensitive to the seed set size on 
real networks than that on LFR benchmark. In practice, 
a seed set size of 3 can guarantee a good performance on 
real networks. As for LFR, we adopt the seed set size to be 
proportional to the community size. 

Third, LEMON is more sensitive to the high-degree seeds 
on real networks than that on LFR benchmark. In LFR 
graphs, the degree of a vertex is at most 50. Whereas in 
some large real networks such as YouTube, the degree of 
some vertices exceeds 1000, making the degree distribution 
much more screw than that seen in LFR graphs. And we 
expect that vertices with unusually high degree in real net¬ 
works would have a stronger power in controlling the trend 
for the probabilities to spread out during the random walk, 
and thus have a higher risk to enter some other neighbor- 


7. CONCLUSION 

The problem of identifying small community structure in 
large networks has been gaining importance. In this paper, 
we have presented a method for finding overlapping commu¬ 
nities by seeking a sparse vector in the span of local spectra 
where the seeds are in its support. To overcome the draw¬ 
backs of traditional spectral clustering methods, we propose 
a novel method to construct the local spectra based on the 
singular vector approximations drawn from short random 
walks. Our algorithm enables finding a small community in 
time functional to the size of the community, and it consis¬ 
tently returns the result within seconds even for a network 
with billions of vertices. We demonstrate the effectiveness 
and efficiency of our method for discovering communities on 
both synthetic and real-world datasets. As the experimental 
result shows, our algorithm achieves the highest detection 
accuracy amongst the state-of-the-art proposals. 

Many other fundamentally important research questions 
remain to be addressed. First, the community detection 
algorithm based on local spectral clustering could be po¬ 
tentially applied to the membership detection problem, i.e., 
finding all the communities that an arbitrary vertex belongs 
to. Second, during the process of seed set expansion, we 
adopt the first low-conductance community as the target 
community, which usually yields a high resemblance to the 
ground truth community. It would also be interesting to 
look further into some larger low-conductance communities 
and see if a hierarchical structure exists. In this case, some 
large social group consisting of several small cliques is likely 
to be discovered. 
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